IBM HR Analytics Employee Attrition & Performance¶
PREPARED BY : Raj Purohith Arjun¶
UIN : 535005964¶
STAT 650¶
1. Overview¶
Research Background¶
The IBM HR Analytics Employee Attrition & Performance dataset includes various factors related to employee demographics, job satisfaction, work-life balance, and performance metrics. Researchers and organizations might be interested in this dataset to understand the factors influencing employee attrition. This analysis can help improve employee retention strategies, enhance job satisfaction, and optimize organizational performance.
Objective¶
The primary objective of this project is to analyze the factors influencing employee attrition within an organization by examining variables such as employee age, gender, monthly income, distance from home, job role, and various satisfaction levels. Specifically, we aim to investigate how factors like job satisfaction, environment satisfaction, work-life balance, and performance ratings contribute to the likelihood of employees leaving the organization. Using statistical and machine learning techniques, we will identify significant predictors of attrition and develop a model to estimate the probability of an employee attriting. This analysis will provide actionable insights for improving employee retention and optimizing organizational performance.
Research Questions¶
In addition to the primary objective of identifying factors influencing employee attrition and predicting attrition probability, this project also addresses the following research questions:
- What demographic and workplace characteristics (e.g., age, gender, job role) are most strongly associated with employee attrition?
- How does an employee's monthly income and distance from home influence their probability of attrition?
- Can a statistical or machine learning model accurately predict employee attrition based on the available workplace and performance data?
- Are certain job roles or departments more prone to higher attrition rates?
- What is the impact of performance ratings on employee retention and how does it vary across different levels of job satisfaction?
Statistical Hypothesis¶
The primary goal of this analysis is to determine if there are significant relationships between employee characteristics, job-related factors, and the likelihood of attrition. The statistical hypothesis can be formulated as follows:
Null Hypothesis (H0): There is no significant relationship between employee attributes (such as age, job satisfaction, monthly income, distance from home, work-life balance, and performance rating) and employee attrition.
Alternative Hypothesis (H1): There is a significant relationship between employee attributes and employee attrition.
2. Dataset Description¶
This dataset focuses on employee information and workplace attributes, with the goal of analyzing employee attrition, a key factor in understanding why employees leave the organization. By studying the various factors affecting attrition, the goal is to develop strategies to improve employee retention and organizational performance.
Data Set Source: Kaggle - IBM HR Analytics Employee Attrition & Performance Dataset
Dataset Characteristics:¶
- Rows (Observations): 1,470 employees
- Columns (Variables): The dataset contains 35 variables. Below is a full list of the variables incded:
Variables Description¶
Dependent Variable (Target):¶
- Attrition: This is the target variable, indicating whether the employee has left the organization (Yes = Attrited, No = Not Attrited).
Independent Variables:¶
These variables are used to predict or analyze the likelihood of attrition:
- Age: The age of the employee (numeric).
- Gender: The employee’s gender (Male/Female).
- BusinessTravel: Frequency of business travel (categorical).
- DailyRate: Daily rate of the employee (numeric).
- DistanceFromHome: Distance between the employee's home and workplace (numeric).
- Education: The highest level of education completed (numeric: 1 to 5).
- EducationField: Field of education (Life Sciences, Medical, Other, Marketing, Technical Degree).
- EmployeeCount: Number of employees in the company (constant).
- EmployeeNumber: Unique identifier for each employee (constant).
- EnvironmentSatisfaction: Satisfaction with the work environment (ordinal).
- HourlyRate: Hourly rate of the employee (numeric).
- JobInvolvement: Employee's involvement in their job (ordinal).
- JobLevel: Job level (numeric).
- JobRole: The role or job title of the employee (Sales Executive, Research Scientist, etc.).
- JobSatisfaction: Employee's satisfaction with their job (ordinal).
- MaritalStatus: Marital status of the employee (Single, Married, Divorced).
- MonthlyIncome: Monthly income of the employee (numeric).
- MonthlyRate: Monthly rate of the employee (numeric).
- NumCompaniesWorked: Number of companies the employee has worked for (numeric).
- OverTime: Whether the employee works overtime (categorical: Yes, No).
- PercentSalaryHike: Percentage salary increase (numeric).
- PerformanceRating: Performance rating of the employee (ordinal).
- RelationshipSatisfaction: Satisfaction with workplace relationships (ordinal).
- StandardHours: Standard hours (constant).
- StockOptionLevel: Level of stock options (numeric).
- TotalWorkingYears: Total years of work experience (numeric).
- TrainingTimesLastYear: Number of training sessions in the past year (numeric).
- WorkLifeBalance: Employee's work-life balance (ordinal).
- YearsAtCompany: Number of years the employee has worked at the company (numeric).
- YearsInCurrentRole: Number of years in the current role (numeric).
- YearsSinceLastPromotion: Number of years since last promotion (numeric).
- YearsWithCurrManager: Number of years with the current manager (numeric).
Irrelevant Variables:¶
- EmployeeCount: Constant value across rows, irrelevant to attrition analysis.
- StandardHours: Constant value for all employees, adds no value to prediction.
- EmployeeNumber: Unique identifier, irrelevant to attrition behavior.
prediction.
Information about Dataset¶
# Load the dataset
import pandas as pd
df=pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
# Display the first few rows of the dataset in a visually appealing format
print("Dataset Loaded Successfully!")
display(df.head())
Dataset Loaded Successfully!
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
# Display basic information about the dataset
print("\nDataset Information:")
df.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
3. Data Pre-Processing¶
Data pre-processing is an important step in preparing the dataset for analysis. This step includes handling missing values, encoding categorical variables, and scaling numerical features.
3.1 Handling Missing Values and Duplicates¶
First, we check for any missing values in the dataset. If there are any, we decide whether to impute them (using mean, median, or mode) or remove them.
# Check for missing values
missing_values = df.isnull().sum()
# Display missing values count for each column
print("Missing Values in Each Column:")
print(missing_values)
Missing Values in Each Column: Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EmployeeCount 0 EmployeeNumber 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 Over18 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StandardHours 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64
Explanation:¶
- As we check for missing values using
isnull().sum(), we confirm that the dataset does not contain any missing values. If there were any, we would proceed to impute or remove them.
# Check for duplicates in the dataset
duplicates = df.duplicated()
# Display the number of duplicate rows
num_duplicates = duplicates.sum()
print(f'Number of duplicate rows: {num_duplicates}')
# If duplicates exist, we can drop them
if num_duplicates > 0:
# Drop duplicate rows
df = df.drop_duplicates()
print("Duplicates have been removed from the dataset.")
else:
print("No duplicates found in the dataset.")
Number of duplicate rows: 0 No duplicates found in the dataset.
3.2 Encoding Categorical Variables¶
3.2.1 Label Encoding for Ordinal Variables:¶
Ordinal variables like EnvironmentSatisfaction, JobSatisfaction, WorkLifeBalance, PerformanceRating , RelationshipSatissfaction and WorkLifeBalance can be encoded using Label Encoding.
Ordinal Variables:
EnvironmentSatisfaction: Ordered levels (e.g., "Low", "Medium", "High", "Very High").
JobInvolvement: Ordered levels (e.g., "Low", "Medium", "High").
JobSatisfaction: Ordered levels (e.g., "Low", "Medium", "High", "Very High").
PerformanceRating: Ordered levels (e.g., "Excellent", "Good", "Needs Improvement").
RelationshipSatisfaction: Ordered levels (e.g., "Low", "Medium", "High").
WorkLifeBalance: Ordered levels (e.g., "Bad", "Good", "Best").
This encoding assigns an integer to each unique category, maintaining the ordinal relationship.
# List of ordinal columns
from sklearn.preprocessing import LabelEncoder
ordinal_columns = ['EnvironmentSatisfaction', 'JobInvolvement', 'JobSatisfaction',
'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance']
# Initialize the LabelEncoder
label_encoder = LabelEncoder()
# Apply Label Encoding for each ordinal column
for col in ordinal_columns:
df[col] = label_encoder.fit_transform(df[col])
# Display a sample of the DataFrame after label encoding
df[ordinal_columns].head()
| EnvironmentSatisfaction | JobInvolvement | JobSatisfaction | PerformanceRating | RelationshipSatisfaction | WorkLifeBalance | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 0 | 0 | 0 |
| 1 | 2 | 1 | 1 | 1 | 3 | 2 |
| 2 | 3 | 1 | 2 | 0 | 1 | 2 |
| 3 | 3 | 2 | 2 | 0 | 2 | 2 |
| 4 | 0 | 2 | 1 | 0 | 3 | 2 |
3.2.2 One-Hot Encoding for Nominal Variables:¶
Nominal Variables:
BusinessTravel: Categories (e.g., "Travel_Frequently", "Non-Travel") have no inherent order.
Department: Different departments (e.g., "Sales", "HR") with no rank or order.
EducationField: Different fields (e.g., "Life Sciences", "Marketing") without order.
Gender: Categories ("Male", "Female") with no ranking.
JobRole: Different job titles (e.g., "Manager", "Sales Executive") with no inherent order.
MaritalStatus: Categories ("Single", "Married") with no order.
OverTime: Binary categories ("Yes", "No") with no ranking.
StockOptionLevel: Different levels (e.g., "0", "1") representing categories, not a numeric scale..
import pandas as pd
# List of nominal columns (categorical columns)
nominal_columns = ['BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'OverTime', 'StockOptionLevel']
# Apply One-Hot Encoding for nominal columns
df_encoded = pd.get_dummies(df, columns=nominal_columns, drop_first=True)
# Display the updated DataFrame with one-hot encoded columns
df_encoded.head()
| Age | Attrition | DailyRate | DistanceFromHome | Education | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | ... | JobRole_Research Director | JobRole_Research Scientist | JobRole_Sales Executive | JobRole_Sales Representative | MaritalStatus_Married | MaritalStatus_Single | OverTime_Yes | StockOptionLevel_1 | StockOptionLevel_2 | StockOptionLevel_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | 1102 | 1 | 2 | 1 | 1 | 1 | 94 | 2 | ... | False | False | True | False | False | True | True | False | False | False |
| 1 | 49 | No | 279 | 8 | 1 | 1 | 2 | 2 | 61 | 1 | ... | False | True | False | False | True | False | False | True | False | False |
| 2 | 37 | Yes | 1373 | 2 | 2 | 1 | 4 | 3 | 92 | 1 | ... | False | False | False | False | False | True | True | False | False | False |
| 3 | 33 | No | 1392 | 3 | 4 | 1 | 5 | 3 | 56 | 2 | ... | False | True | False | False | True | False | True | False | False | False |
| 4 | 27 | No | 591 | 2 | 1 | 1 | 7 | 0 | 40 | 2 | ... | False | False | False | False | True | False | False | True | False | False |
5 rows × 51 columns
Explanation:
Label Encoding for Ordinal Columns:
- Purpose: Convert ordinal categories (with a meaningful order) into integers.
- Columns: ['EnvironmentSatisfaction', 'JobInvolvement', 'JobSatisfaction', 'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance']
- Method: Used LabelEncoder to transform categorical values into integers (e.g., Low = 0, Medium = 1, High = 2).
- Result: The ordinal columns are now encoded as numeric values.
One-Hot Encoding for Nominal Columns:
- Purpose: Convert nominal categories (without order) into binary columns.
- Columns: ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime', 'StockOptionLevel']
- Method: Applied pd.get_dummies() to create binary columns for each category (e.g., BusinessTravel_Travel_Frequently = 1 if the employee frequently travels).
- Result: Nominal columns expanded into multiple binary columns, each representing a category.
Summary:
- Label Encoding: For ordinal columns (values with order).
- One-Hot Encoding: For nominal columns (values without order).
3.3 Scaling and Normalization¶
This normalization ensures that the model's training is not biased toward variables with larger scales, improving model performance and convergence.
from sklearn.preprocessing import StandardScaler
# List of numerical columns to be standardized
numerical_columns = ['MonthlyIncome', 'MonthlyRate']
# Initialize StandardScaler (Z-score normalization)
scaler = StandardScaler()
# Apply the scaler to the numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
# Show the first few rows of the transformed numerical columns
print("\nFirst few rows of the scaled numerical columns:")
print(df[numerical_columns].head())
# Check the shape of the dataset after processing
print("\nShape of the dataset after processing:", df.shape)
First few rows of the scaled numerical columns: MonthlyIncome MonthlyRate 0 -0.108350 0.726020 1 -0.291719 1.488876 2 -0.937654 -1.674841 3 -0.763634 1.243211 4 -0.644858 0.325900 Shape of the dataset after processing: (1470, 35)
Explanation
Standardization (Z-Score Normalization): The StandardScaler was applied to the numerical columns ['MonthlyIncome', 'MonthlyRate']. This technique transforms the data so that each feature has a mean of 0 and a standard deviation of 1, making the data comparable across different scales.
Transformation: The scaler fits the data and then transforms the values. After transformation, each value in these columns represents how many standard deviations it is away from the mean of that feature.
Output:
The first few rows show the scaled values for MonthlyIncome and MonthlyRate. For example, a value of -0.108350 for MonthlyIncome means that the original value is slightly below the mean of that feature. Similarly, 0.726020 for MonthlyRate indicates that it is above the mean, but less than one standard deviation.
Shape of the Dataset: The shape of the dataset is (1470, 35), meaning that after transformation, the dataset still contains 1470 rows and 35 columns, with only the numerical columns being scaled.
Dropping Irrevelant Columns¶
# Drop irrelevant columns from the dataset
df_cleaned = df.drop(columns=['EmployeeCount', 'StandardHours', 'PerformanceRating','EmployeeNumber'])
# Check the shape of the dataset after dropping columns
print(df_cleaned.shape)
(1470, 31)
4. Exploratory Data Analysis (EDA)¶
4.1 Univariate Analysis¶
4.1.1 Summary Statistics¶
Summary statistics offer a numerical snapshot of the dataset's characteristics. Key metrics include:
Mean
This represents the average value of a variable, aiding in the identification of central tendency. It reflects the dataset's size and highlights any missing entries.
Median
The midpoint value that provides insights into the distribution, particularly in cases of skewed data. This metric serves as a robust indicator of central tendency, being less influenced by outliers.
Mode
The value that appears most frequently in the dataset, shedding light on prevalent traits. It identifies common values and trends within the data.
Standard Deviation
A measure of the data's variability or dispersion relative to the mean. This statistic helps assess how spread out the data is; a higher standard deviation indicates greater variability.
Minimum (min)
The lowest value in a dataset for a given variable, offering insights into the lower boundary and assisting in identifying outliers. It provides context for the lower limits of the data.
25th Percentile (Q1)
The threshold below which 25% of the data falls, useful for examining the lower end of the distribution. This metric is instrumental in understanding the lower range of data.
75th Percentile (Q3)
The threshold below which 75% of the data lies, helping to assess the upper end of the distribution and detect potential skewness. This value aids in grasping the upper range of data distribution.
Insights
The summary statistics can illuminate the distribution characteristics of the features, revealing whether they are normally distributed, skewed, or contain outliers. For instance, if the mean age of employees is significantly higher than the median, it may indicate the presence of younger employees within the dataset. Similarly, if the standard deviation is high, it suggests that there is substantial variability in certain variables (e.g., income, job satisfaction), and identifying these patterns can help in understanding factors related to employee attrition.
4.1.1 Summary Statistics¶
df_cleaned.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1470.0 | 3.692381e+01 | 9.135373 | 18.000000 | 30.000000 | 36.000000 | 43.000000 | 60.000000 |
| DailyRate | 1470.0 | 8.024857e+02 | 403.509100 | 102.000000 | 465.000000 | 802.000000 | 1157.000000 | 1499.000000 |
| DistanceFromHome | 1470.0 | 9.192517e+00 | 8.106864 | 1.000000 | 2.000000 | 7.000000 | 14.000000 | 29.000000 |
| Education | 1470.0 | 2.912925e+00 | 1.024165 | 1.000000 | 2.000000 | 3.000000 | 4.000000 | 5.000000 |
| EnvironmentSatisfaction | 1470.0 | 1.721769e+00 | 1.093082 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 3.000000 |
| HourlyRate | 1470.0 | 6.589116e+01 | 20.329428 | 30.000000 | 48.000000 | 66.000000 | 83.750000 | 100.000000 |
| JobInvolvement | 1470.0 | 1.729932e+00 | 0.711561 | 0.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 |
| JobLevel | 1470.0 | 2.063946e+00 | 1.106940 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 5.000000 |
| JobSatisfaction | 1470.0 | 1.728571e+00 | 1.102846 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 3.000000 |
| MonthlyIncome | 1470.0 | -4.471102e-17 | 1.000340 | -1.167343 | -0.763209 | -0.336552 | 0.398625 | 2.867626 |
| MonthlyRate | 1470.0 | 3.021015e-17 | 1.000340 | -1.717284 | -0.880644 | -0.010906 | 0.864101 | 1.782888 |
| NumCompaniesWorked | 1470.0 | 2.693197e+00 | 2.498009 | 0.000000 | 1.000000 | 2.000000 | 4.000000 | 9.000000 |
| PercentSalaryHike | 1470.0 | 1.520952e+01 | 3.659938 | 11.000000 | 12.000000 | 14.000000 | 18.000000 | 25.000000 |
| RelationshipSatisfaction | 1470.0 | 1.712245e+00 | 1.081209 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 3.000000 |
| StockOptionLevel | 1470.0 | 7.938776e-01 | 0.852077 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 3.000000 |
| TotalWorkingYears | 1470.0 | 1.127959e+01 | 7.780782 | 0.000000 | 6.000000 | 10.000000 | 15.000000 | 40.000000 |
| TrainingTimesLastYear | 1470.0 | 2.799320e+00 | 1.289271 | 0.000000 | 2.000000 | 3.000000 | 3.000000 | 6.000000 |
| WorkLifeBalance | 1470.0 | 1.761224e+00 | 0.706476 | 0.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 |
| YearsAtCompany | 1470.0 | 7.008163e+00 | 6.126525 | 0.000000 | 3.000000 | 5.000000 | 9.000000 | 40.000000 |
| YearsInCurrentRole | 1470.0 | 4.229252e+00 | 3.623137 | 0.000000 | 2.000000 | 3.000000 | 7.000000 | 18.000000 |
| YearsSinceLastPromotion | 1470.0 | 2.187755e+00 | 3.222430 | 0.000000 | 0.000000 | 1.000000 | 3.000000 | 15.000000 |
| YearsWithCurrManager | 1470.0 | 4.123129e+00 | 3.568136 | 0.000000 | 2.000000 | 3.000000 | 7.000000 | 17.000000 |
Insights from Summary Statistics:
Age:
- Mean age is approximately 37 years, with a wide range from 18 to 60 years.
- Indicates a diverse workforce in terms of age distribution.
DailyRate:
- Average daily rate is 802, with substantial variability (std ~ 403).
- Suggests a significant difference in daily pay among employees.
DistanceFromHome:
- Employees live an average of 9.19 units (e.g., miles) from work, with a range of 1 to 29 units.
- This could influence work-life balance or commuting patterns.
Education:
- Median education level is 3, suggesting most employees have a bachelor's or similar level of education.
- Spans from 1 (lowest) to 5 (highest), indicating a mix of educational backgrounds.
EnvironmentSatisfaction:
- Average satisfaction is 1.72 (scale likely 0-3).
- Shows room for improvement in workplace satisfaction.
HourlyRate:
- Mean hourly rate is approximately 66, ranging from 30 to 100.
- Highlights variability in compensation.
JobInvolvement:
- Average score is 1.73 (likely scale 0-3), indicating moderate involvement.
- Potential area for initiatives to increase engagement.
JobLevel:
- Median level is 2, with a range from 1 to 5.
- Reflects a typical hierarchical structure.
JobSatisfaction:
- Average satisfaction is 1.73 (scale likely 0-3).
- Similar to environment satisfaction, opportunities for improvement exist.
MonthlyIncome and MonthlyRate:
- Both variables have been standardized (mean ~ 0, std ~ 1).
- Direct interpretation of raw values is not meaningful without context.
NumCompaniesWorked:
- Employees have worked at an average of 2.7 companies, with a wide range (0-9).
- Reflects varying levels of industry experience.
PercentSalaryHike:
- Average increase is 15%, with a range of 11% to 25%.
- Reflects annual appraisal increments.
RelationshipSatisfaction:
- Similar trends to job satisfaction; room for improvement exists.
StockOptionLevel:
- Median level is 1, with some employees receiving higher stock options.
- Indicates variation in benefits offered.
TotalWorkingYears:
- Mean total years worked is 11.28, with a range of 0 to 40.
- Suggests a mix of experienced and early-career employees.
TrainingTimesLastYear:
- Median is 3, indicating moderate training frequency.
- Variability shows differences in training opportunities.
WorkLifeBalance:
- Average score is 1.76 (scale likely 0-3).
- Highlights a potential area for improvement in employee well-being.
YearsAtCompany:
- Median tenure is 5 years, ranging from 0 to 40.
- Reflects varying degrees of employee loyalty and tenure.
YearsInCurrentRole:
- Median is 3 years, with some employees holding roles for up to 18 years.
- Indicates a mix of recently and long-tenured role holders.
YearsSinceLastPromotion:
- Median is 1 year, but some employees haven't been promoted for up to 15 years.
- Reflects a potential issue with career progression for some employees.
YearsWithCurrManager:
- Median is 3 years, indicating moderate stability in reporting relationships.
positions.
4.1.2 Visualization¶
This section focuses on analyzing quantitative variables to identify patterns related to employee attrition. Histograms help visualize the frequency distributions of variables such as age, daily rate, and distance from home, revealing their spread and any potential skewness. For instance, the histogram of Age may highlight a concentration of employees in their mid-30s to early 40s. Box plots are used to detect outliers in features like MonthlyIncome and JobSatisfaction, offering insights into anomalies or extreme values. Density plots provide smooth visualizations of distributions, such as YearsAtCompany, helping to understand where most employees fall in terms of tenure and identifying trends that may correlate with attrition. Together, these visual tools are crucial for uncovering trends and outliers, guiding subsequent analysis and modeling
Histogram¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# List of numerical columns to perform univariate analysis on
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Number of columns for subplots (adjust based on your preference)
n_cols = 3
# Number of rows based on the number of columns
n_rows = len(numerical_cols) // n_cols + (len(numerical_cols) % n_cols > 0)
# Setting up the plotting area
fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 5 * n_rows)) # Adjust height as needed
fig.subplots_adjust(hspace=0.5, wspace=0.3) # Adjust the space between subplots
# Loop through and create histogram for each column
for i, col in enumerate(numerical_cols):
row = i // n_cols # Get the row index
col_index = i % n_cols # Get the column index
sns.histplot(df[col], bins=30, kde=False, ax=axes[row, col_index])
axes[row, col_index].set_title(f'Histogram of {col}')
axes[row, col_index].set_xlabel(col)
axes[row, col_index].set_ylabel('Frequency')
# Remove empty subplots if there are any
for j in range(i + 1, n_rows * n_cols):
fig.delaxes(axes[j // n_cols, j % n_cols])
# Display the plots
plt.show()
Key Insights and Findings
Demographic Profile
The age distribution shows a bell-shaped curve centered around 30-35 years, indicating a predominantly young to middle-aged workforce. Most employees have an education level of 3.0, suggesting a well-educated workforce, with another significant group at level 4.0.
Work Environment Characteristics
Distance and Location
The distance-from-home histogram reveals that a majority of employees live within 10 units of their workplace, with the highest concentration within 0-5 units. This suggests a locally concentrated workforce.
Work Experience and Tenure
Total working years peak around 10-15 years, indicating a mid-career dominant workforce.
Years at company shows a concerning trend with most employees having less than 5 years of tenure, potentially signaling high turnover.
Professional Development
The training frequency histogram shows that most employees received 2-3 training sessions in the last year. This moderate level of professional development investment could be enhanced to improve retention.
Satisfaction Metrics
Job Satisfaction
The distribution shows an interesting bimodal pattern with peaks at levels 1 and 3, suggesting a polarized workforce in terms of job satisfaction. This could be a critical indicator for potential attrition risks.
Work-Life Balance
The work-life balance metric shows a strong concentration around levels 2 and 3, indicating generally positive perceptions among employees.
Compensation Structure
The monthly income histogram displays a right-skewed distribution, with a large concentration of employees at the lower end of the pay scale. This salary distribution pattern, combined with the satisfaction metrics, could be a contributing factor to potential attrition.
These insights suggest several areas requiring attention, particularly around employee retention strategies, compensation structure, and job satisfaction improvement initiatives. ng levels of satisfaction across different job aspects. g levels of satisfaction across different job aspects.
Box Plot¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming the dataset is loaded into 'df'
# List of numerical columns to perform univariate analysis on
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Number of columns for subplots (adjust based on your preference)
n_cols = 4 # You can change this value for a different number of columns per row
# Number of rows based on the number of columns
n_rows = len(numerical_cols) // n_cols + (len(numerical_cols) % n_cols > 0)
# Setting up the plotting area for box plots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 5 * n_rows)) # Adjust height as needed
fig.subplots_adjust(hspace=0.5, wspace=0.3) # Adjust the space between subplots
# Loop through and create boxplot for each column
for i, col in enumerate(numerical_cols):
row = i // n_cols # Get the row index
col_index = i % n_cols # Get the column index
# Create Boxplot
sns.boxplot(x=df[col], ax=axes[row, col_index], color='orange')
axes[row, col_index].set_title(f'Boxplot of {col}')
axes[row, col_index].set_xlabel(col)
# Remove empty subplots if there are any
for j in range(i + 1, n_rows * n_cols):
fig.delaxes(axes[j // n_cols, j % n_cols])
# Display the plots
plt.show()
Key Insights and Findings
Distribution Analysis
Age Distribution
The boxplot shows a median age around 35 years, with the interquartile range spanning roughly from 30 to 40 years. The whiskers extend from approximately 20 to 60 years, indicating a well-distributed age range with no significant outliers.
Distance From Home
A right-skewed distribution is evident with the median around 7-8 units. The box (IQR) shows most employees live within 5-15 units from work, while the extended upper whisker indicates some employees commute from up to 30 units away.
Education and Job Satisfaction
Education levels show a compact box between levels 2-4, suggesting consistent educational qualifications across the workforce.
Job satisfaction displays an even distribution across all four levels (0-3), with the box spanning almost the entire range.
Career Metrics
Monthly Income
The income boxplot reveals significant right skewing with numerous high-income outliers. The main body of the distribution (box) is concentrated in the lower income ranges, suggesting a large gap between average and top earners.
Experience Indicators
Total Working Years shows a median around 10 years, with several outliers extending beyond 30 years.
Years at Company displays a heavily right-skewed distribution with numerous outliers beyond 20 years, while the majority (box) remains under 10 years.
Professional Development
Training Times Last Year shows a compact distribution with the box centered around 2-3 sessions, with a few outliers receiving up to 6 training sessions.
Work-Life Balance
The distribution is relatively symmetric across the 0-3 scale, with the box spanning levels 1-2, indicating moderate satisfaction with work-life balance across the workforce.
Density Plot¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming the dataset is loaded into 'df'
# List of numerical columns to perform univariate analysis on
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Number of columns for subplots (adjust based on your preference)
n_cols = 4 # You can change this value for a different number of columns per row
# Number of rows based on the number of columns
n_rows = len(numerical_cols) // n_cols + (len(numerical_cols) % n_cols > 0)
# Setting up the plotting area for density plots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 5 * n_rows)) # Adjust height as needed
fig.subplots_adjust(hspace=0.5, wspace=0.3) # Adjust the space between subplots
# Loop through and create density plot for each column
for i, col in enumerate(numerical_cols):
row = i // n_cols # Get the row index
col_index = i % n_cols # Get the column index
# Create Density Plot
sns.kdeplot(df[col], ax=axes[row, col_index], color='green')
axes[row, col_index].set_title(f'Density Plot of {col}')
axes[row, col_index].set_xlabel(col)
axes[row, col_index].set_ylabel('Density')
# Remove empty subplots if there are any
for j in range(i + 1, n_rows * n_cols):
fig.delaxes(axes[j // n_cols, j % n_cols])
# Display the plots
plt.show()
Key Insights and Findings¶
Demographic Distribution
Age Profile
The age density plot shows a unimodal distribution with peak density around 35-40 years. The curve exhibits a slight right skew, with a gradual tail extending to 60 years and a sharp drop-off below 25 years.
Educational Pattern
The education density reveals multiple peaks, with the highest at level 3, followed by significant peaks at levels 2 and 4. This multimodal distribution suggests distinct educational qualification clusters within the workforce.
Work-Related Metrics
Distance Distribution
The distance-from-home plot shows a rapidly declining curve with:
- Highest density at 0-5 units
- Secondary peak around 10 units
- Long tail extending to 30 units
Career Metrics
Total Working Years displays a right-skewed distribution peaking at 10 years.
Years at Company shows a sharp peak at 5-7 years with rapid decline thereafter.
Satisfaction and Development
Job Satisfaction
The distribution is notably bimodal with:
- Major peaks at levels 2 and 3
- Minor peaks at levels 0 and 1
- Similar density heights for the major peaks
Work-Life Balance
Shows a distinctive trimodal distribution:
- Dominant peak at level 2
- Secondary peak at level 1
- Smaller peak at level 3
Training and Compensation
Training Frequency
The training times distribution shows:
- Primary peak at 2-3 sessions
- Secondary peak at 3-4 sessions
- Multiple smaller modes beyond 4 sessions
Monthly Income
Exhibits a heavily right-skewed distribution with:
- Primary peak below median income
- Multiple small peaks in higher income ranges
- Long tail extending into higher income brackets
4.2 Bivariate Analysis: Analyzing Relationships and Dependencies¶
In this section, we explore the relationships between various pairs of variables in the dataset to uncover potential correlations and dependencies, which may help in predicting employee attrition.
Correlation Matrix
The correlation matrix shows the strength and direction of relationships between numerical variables. Values range from -1 to 1:
- A value close to 1 indicates a strong positive correlation (as one variable increases, the other also increases).
- A value close to -1 indicates a strong negative correlation (as one variable increases, the other decreases).
- A value around 0 suggests no correlation between the variables.
For example, if there is a positive correlation between JobSatisfaction and JobLevel, this suggests that employees with higher satisfaction tend to hold higher job levels.
Scatter Plot
The scatter plot visualizes the relationship between two quantitative variables. By plotting data points on a Cartesian plane, it helps to identify trends, clusters, and outliers.
Example: In a YearsAtCompany vs. MonthlyIncome scatter plot, we may see that employees with longer tenures have a broader range of monthly incomes, while those with shorter tenures show more concentrated income values. This pattern could help to identify if employees are more likely to leave based on their income level and time at the company.
Pair Plot
A pair plot is a grid of scatter plots that visualizes pairwise relationships among multiple variables. It also includes histograms to show the distribution of each variable individually. This plot can help identify patterns or groupings of variables that may be correlated with employee attrition.
Example: A pair plot involving Age, JobSatisfaction, and Exited (attrition status) can reveal that employees who have exited (displayed in a different color) may cluster in specific areas of the plot, such as older individuals with low job satisfaction. This clustering could suggest that employees who are older and dissatisfied are more likely to leave the company.
By conducting this bivariate analysis, we aim to uncover key relationships between variables that are predictive of employee attrition, allowing for better decision-making in addressing retention strategies.
Corelation Matrix¶
import itertools
# List of numerical columns
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Correlation matrix heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df[numerical_cols].corr(), annot=True, fmt='.2f', cmap='coolwarm', cbar=True)
plt.title('Correlation Matrix for All Numerical Variables')
plt.show()
Key Insights and Findings
Strong Positive Correlations
Experience and Income Metrics
- Total Working Years shows strong positive correlation with Monthly Income (0.77) and Age (0.68)
- Years at Company correlates significantly with Total Working Years (0.63) and Monthly Income (0.51)
- Age and Monthly Income display moderate positive correlation (0.50)
Weak or No Correlations
Job Satisfaction
- Shows remarkably weak correlations with all other variables (mostly around -0.01 to -0.02)
- Indicates job satisfaction is independent of factors like salary, age, or experience
Work-Life Balance
- Displays minimal correlation with all variables (coefficients near 0)
- Suggests work-life balance is maintained consistently across different employee segments
Distance From Home
- Shows negligible correlations with all variables
- Implies commute distance doesn't influence other workplace factors
Career Development
Training
- Training Times Last Year shows very weak negative correlations with most variables
- Suggests training opportunities are distributed independently of experience or position
Education
- Shows weak positive correlations with Age (0.21) and Total Working Years (0.15)
- Indicates educational level has minimal impact on other career metrics
Key Insights
- Career progression metrics (experience, income, age) are strongly interconnected
- Personal factors (job satisfaction, work-life balance) operate independently
- Training and education show surprising independence from career advancement metrics
- Location and distance factors have minimal impact on other variables
Scatter Plot¶
from itertools import combinations
import math
# List of numerical columns
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Generate all combinations of numerical columns
combinations_list = list(combinations(numerical_cols, 2))
# Calculate rows and columns dynamically
n_plots = len(combinations_list)
n_cols = 4 # Number of columns in the grid
n_rows = math.ceil(n_plots / n_cols) # Calculate the required number of rows
# Plot scatter plots for all pairs
plt.figure(figsize=(n_cols * 5, n_rows * 4)) # Adjust size dynamically
for i, (x, y) in enumerate(combinations_list, 1):
plt.subplot(n_rows, n_cols, i)
sns.scatterplot(data=df, x=x, y=y, hue='Attrition', alpha=0.6)
plt.title(f'{x} vs {y}')
plt.xlabel(x)
plt.ylabel(y)
plt.tight_layout()
plt.show()
Key Insights and Findings
Age-Related Patterns
Career Progression
- Strong positive linear relationship between age and total working years, indicating natural career progression
- Age shows positive correlation with monthly income, though with high variance
- Years at company increases with age but shows significant scatter, suggesting varied retention patterns
Attrition Risk Factors
- Higher attrition (blue dots) appears more concentrated among younger employees (20-35 age range)
- Employees with longer distance from home show higher attrition across age groups
- Lower income brackets show higher concentration of attrition cases
Education and Training Insights
Educational Impact
- Education levels (1-5) are evenly distributed across age groups
- No clear relationship between education level and monthly income
- Higher education doesn't necessarily correlate with longer company tenure
Training Patterns
- Training frequency remains consistent across age groups
- No clear relationship between training and attrition
- Training distribution is similar regardless of distance from home
Job Satisfaction Analysis
Satisfaction Correlations
- Job satisfaction shows no clear correlation with age or income
- Work-life balance appears consistent across all age groups
- No strong relationship between satisfaction and years at company
Income Distribution
Salary Patterns
- Monthly income shows positive correlation with total working years
- Higher variance in income for employees with longer tenure
- Distance from home shows no clear impact on income levels
Critical Attrition Insights
High-Risk Groups
- Young employees with lower income
- Employees with longer commute distances
- Mid-career professionals with lower-than-expected income progression
Retention Factors
- Work-life balance appears consistent regardless of other factors
- Training opportunities distributed evenly across employee segments
- Job satisfaction varies independently of traditional career metrics
Key Recommendations
- Focus retention strategies on young employees and address commute-related concerns
- Maintain the positive aspects of work-life balance and training distribution
Pair Plot¶
import seaborn as sns
import matplotlib.pyplot as plt
# Select numerical columns for the pair plot
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# Create a pair plot with appropriate aesthetics
sns.pairplot(
data=df,
vars=numerical_cols,
hue='Attrition', # Use 'Churn Value' to differentiate
palette='coolwarm', # Use a visually clear color palette
diag_kind='kde', # Show KDE on the diagonal
markers=["o", "s"], # Distinguish points by marker
height=2.5 # Adjust the size of each plot
)
# Display the plot
plt.show()
Key Variable Relationships
The pairplot reveals several important patterns regarding IBM employee attrition:
Age and Experience
The scatter plots show that younger employees tend to have higher attrition rates, with the density of "Yes" responses being greater in the lower age ranges. Employees with more years at the company demonstrate lower attrition rates, suggesting that longer tenure correlates with higher retention.
Income Patterns
Monthly income displays a clear relationship with attrition. Lower-income employees show higher attrition rates, while the scatter plots indicate that employees with higher salaries are more likely to stay. This suggests compensation plays a crucial role in retention decisions.
Distance and Work-Life Balance
The distribution plots indicate that employees who live farther from work have increased attrition rates. Work-life balance scores show distinct patterns between those who stay and leave, highlighting its importance in retention.
Education and Job Satisfaction
The visualization reveals that job satisfaction has a notable impact on attrition, with lower satisfaction levels corresponding to higher attrition rates. Education levels show some variation in attrition patterns, though the relationship appears less pronounced than other factors.
Notable Correlations
- A clear correlation exists between years of experience and attrition, with newer employees showing a higher likelihood of leaving.
- The relationship between monthly income and years at the company appears positive, suggesting that longer-tenured employees generally earn more and are less likely to leave.
Recommendations
The data suggests focusing retention strategies on:
- Early-career employees
- Those with lower compensation levels
- Employees with longer commute distances
- Staff showing signs of job dissatisfaction
These insights can help develop targeted retention programs to reduce attrition rates effectively. trition rates effectively.
4.3 Multvariate Analysis¶
Multivariate analysis examines interactions among three or more variables, providing insights into complex relationships and patterns within the dataset.
Heatmap:
A heatmap visualizes correlations among multiple variables using color gradients to indicate the strength and direction of relationships.
Example: In the correlation heatmap for Age, MonthlyIncome, YearsAtCompany, and ExitStatus, a moderate positive correlation (0.29) exists between Age and ExitStatus, while a negative correlation (-0.30) is observed between YearsAtCompany and MonthlyIncome. These correlations help identify key factors related to employee retention and turnover.
Pairwise Relationships:
This technique analyzes relationships among multiple variables by assessing pairs together, revealing clusters and outliers.
Example: A pair plot for Age, MonthlyIncome, YearsAtCompany, and ExitStatus shows that employees who exited tend to cluster in specific areas, highlighting shared characteristics that influence job change behavior.
Principal Component Analysis (PCA):
PCA reduces dimensionality by transforming the dataset into uncorrelated variables known as principal components, facilitating visualization of data.
Example: Applying PCA to Age, MonthlyIncome, and YearsAtCompany reduces the data to two or three components, clearly distinguishing clusters of employees who exited from those who stayed.
Key Insights:
The analysis of multiple variables uncovers intricate relationships and interactions. For example, PCA helps illustrate how distinct groups of employees—those who exited versus those who stayed—can be differentiated based on their characteristics, thereby assisting in the identification of fundamental factors that influence employee retention and turnover.
ntion and turnover.
1. Heat Map¶
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Select relevant numerical columns for multivariate analysis
numerical_cols = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany'
]
# 1. Heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap")
plt.show()
Key Insights and Findings
Strong Positive Correlations
The correlation heatmap reveals several significant relationships:
Experience and Income
- Total Working Years and Monthly Income show a strong positive correlation (0.77)
- Years at Company and Total Working Years display a robust correlation (0.63)
- Monthly Income and Years at Company have a moderate positive correlation (0.51)
Age-Related Patterns
- Age and Total Working Years exhibit a strong positive correlation (0.68)
- Age and Monthly Income show a moderate positive correlation (0.50)
- Age and Years at Company have a weak positive correlation (0.31)
Weak or No Correlations
Work-Life Factors
- Work-Life Balance shows minimal correlation with other variables (all correlations < 0.05)
- Distance From Home has negligible correlations with most variables
- Job Satisfaction demonstrates very weak correlations with all other factors
Notable Insights
The heatmap suggests that:
- Career progression naturally links experience, age, and income
- Work-life balance operates independently of other employment factors
- Job satisfaction appears to be influenced by factors not captured in these variables
- Training times show minimal relationship with other employment metrics
These patterns can inform HR strategies by highlighting the independence of certain factors like work-life balance and job satisfaction from traditional career metrics.
2. Pair Plot¶
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Specify the columns to analyze
columns_to_analyze = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'Attrition'
]
# Filter the DataFrame for the selected columns
analysis_df = df[columns_to_analyze]
### 2. Pair Plot
sns.pairplot(analysis_df, hue='Attrition', palette='husl', diag_kind='kde')
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()
Key Insights and Findings
Age and Experience Relationships
Attrition by Age:
- Younger employees show higher attrition rates with denser clustering in the lower age ranges.
- Attrition probability decreases as age increases, showing a clear negative correlation.
- The distribution is right-skewed for both attrition groups.
Career Progression:
- Total Working Years and Years at Company show strong positive relationships.
- Employees with longer tenure demonstrate lower attrition rates.
- Experience metrics cluster differently between staying and leaving employees.
Income Patterns
Salary Distribution:
- Monthly Income shows a clear bimodal distribution.
- Higher attrition rates concentrate in lower income brackets.
- Income increases show positive correlation with retention.
Education Impact
- Higher education levels slightly correlate with increased income.
- Education level shows minimal direct impact on attrition.
- Distribution across education levels is relatively uniform.
Work-Life Factors
Distance and Satisfaction:
- Distance from home shows scattered distribution with no clear pattern.
- Job satisfaction levels cluster distinctly between attrition groups.
- Work-life balance scores show minimal correlation with other variables.
Training and Development:
- Training times last year shows discrete distribution.
- No strong correlation between training frequency and attrition.
- Training participation appears independent of other career metrics.
These insights suggest that retention strategies should focus on:
- Early-career support for younger employees, especially those in lower income brackets.
- Competitive compensation to improve retention, particularly for employees in lower income ranges.
- Career development opportunities, particularly for employees with longer tenure, to reduce attrition.
3. Principal Component Analysis¶
### 3. Principal Component Analysis (PCA)
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(analysis_df.drop(columns=['Attrition']))
# Apply PCA
pca = PCA(n_components=2)
pca_transformed = pca.fit_transform(scaled_data)
# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(data=pca_transformed, columns=['PC1', 'PC2'])
pca_df['Attrition'] = analysis_df['Attrition'].reset_index(drop=True)
# Scatter Plot of PCA Results
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='Attrition', palette='Set2', alpha=0.7)
plt.title("PCA Scatter Plot")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Attrition")
plt.show()
### 4. Insights
# Explained variance for each component
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by PCA Components: {explained_variance}")
Explained Variance by PCA Components: [0.30759207 0.11913809]
Key Insights from PCA Analysis
Class Distribution Pattern
The scatter plot reveals significant overlap between employees who left (Yes) and stayed (No) with the company, suggesting that attrition patterns are complex and not easily separable using just two principal components.
Data Clustering Characteristics
There is a notable central clustering of both attrition classes, with data points concentrated around the origin. The similar density patterns between both classes indicate that the chosen PCA dimensions capture general employee characteristics rather than distinctive attrition factors.
Dimensionality Implications
The substantial overlap between classes suggests that employee attrition might be influenced by more complex factors that cannot be fully captured in a two-dimensional representation. Additional dimensions or different feature combinations might be necessary for better separation.
Spatial Distribution
- Principal Component 1 shows a wider spread (ranging from approximately -2 to 6).
- Principal Component 2 has a more concentrated range (approximately -3 to 4).
- Outliers are visible in both components, particularly along PC1's positive axis.
Practical Implications
This visualization indicates that predicting employee attrition cannot rely solely on simple linear combinations of features, suggesting the need for more sophisticated modeling approaches or additional relevant variables to improve predictive accuracy.
5 Regression Analysis¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
5.1 Simple Linear Regression¶
df. columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Load your dataset (assuming 'df' is your DataFrame)
# Encode the 'Attrition' column
le = LabelEncoder()
df['Attrition_Encoded'] = le.fit_transform(df['Attrition'])
X = df[['TotalWorkingYears']]
y = df['Attrition_Encoded'] # Use the encoded version of Attrition
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Perform Simple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Evaluate the model performance
r2 = r2_score(y_test, y_pred)
print(f'R²: {r2}')
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {rmse}')
# Visualize the regression line and the residuals
plt.figure(figsize=(10, 6))
# Plotting the regression line
plt.subplot(1, 2, 1)
plt.scatter(X_test, y_test, color='blue', label='Actual data')
plt.plot(X_test, y_pred, color='red', label='Regression Line')
plt.title('Simple Linear Regression')
plt.xlabel('Total Working Years')
plt.ylabel('Attrition (Encoded)')
plt.legend()
# Plotting residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(X_test, residuals, color='green')
plt.axhline(0, color='black', linewidth=1)
plt.title('Residuals')
plt.xlabel('Total Working Years')
plt.ylabel('Residuals')
plt.tight_layout()
plt.show()
R²: 0.021801909470947067 RMSE: 0.335481416976357
Explanation¶
Model Performance Analysis
The linear regression model analyzing employee attrition against total working years shows several key insights:
Statistical Metrics
- The R² value of 0.022 (2.2%) indicates that only a very small portion of the variance in attrition is explained by total working years.
- The RMSE of 0.335 suggests moderate prediction errors in the model.
Visual Interpretation
Scatter Plot Analysis
- The left plot shows a binary distribution of attrition (0 and 1).
- Blue dots represent actual data points with clear clustering at 0 (no attrition) and 1 (attrition).
- The red regression line shows a slight negative slope, suggesting a weak negative correlation.
Residuals Pattern
- The residuals plot (right) shows a systematic pattern.
- Residuals are not randomly scattered, indicating the linear model may not be the best fit.
- There's a visible trend in residuals across working years, suggesting potential non-linear relationships.
Business Insights
Key Findings
- The negative slope suggests that employees with more working years are slightly less likely to leave.
- The model's low R² value indicates that working years alone is not a strong predictor of attrition.
- The binary nature of attrition (0/1) suggests that logistic regression might be more appropriate than linear regression.
5.2 Multi Linear Regression¶
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
# Load your dataset (assuming 'df' is your DataFrame)
# Define independent variables (numerical predictors)
columns_to_analyze = [
'Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'Attrition'
]
# Define dependent variable (target variable, here we use 'Churn Value' or 'Churn Label')
y = df['Attrition'] # or use df['Churn Label']
# Define independent variables (X)
X = df[numerical_cols]
# Add constant term to X for intercept in statsmodels
X = sm.add_constant(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 1. Perform Multiple Linear Regression using sklearn
model = LinearRegression()
model.fit(X_train, y_train)
# 2. Evaluate the model performance on the test data
y_pred = model.predict(X_test)
# R² (R-squared)
r2 = r2_score(y_test, y_pred)
print(f'R²: {r2}')
# RMSE (Root Mean Squared Error)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {rmse}')
# 3. Significance of each predictor variable using statsmodels
# Train the model using statsmodels (for p-values and coefficient analysis)
X_train_sm = sm.add_constant(X_train) # Add constant for intercept
sm_model = sm.OLS(y_train, X_train_sm).fit()
# Print the summary to see p-values and coefficients
print(sm_model.summary())
Explanation¶
Model Performance
The multiple linear regression model shows modest improvement over the simple linear regression:
- R² value increased to 0.060 (6%), indicating the model explains 6% of variance in attrition.
- RMSE of 0.329 shows slightly improved prediction accuracy.
- F-statistic of 8.170 with p-value 7.68e-12 indicates the model is statistically significant.
Significant Predictors
Strong Predictors (p < 0.05):
- JobSatisfaction: Negative coefficient (-0.0378) shows higher satisfaction reduces attrition.
- WorkLifeBalance: Negative coefficient (-0.0354) indicates better work-life balance decreases attrition.
- Age: Negative coefficient (-0.0034) suggests older employees are less likely to leave.
- DistanceFromHome: Positive coefficient (0.0030) shows longer commutes increase attrition.
- TrainingTimesLastYear: Negative coefficient (-0.0165) indicates more training reduces attrition.
Non-Significant Predictors (p > 0.05):
- Education (p = 0.717)
- MonthlyIncome (p = 0.181)
- TotalWorkingYears (p = 0.505)
- YearsAtCompany (p = 0.188)
Model Diagnostics
- Durbin-Watson statistic of 2.104 suggests minimal autocorrelation.
- High Jarque-Bera statistic (573.732) indicates non-normal residuals.
- Skewness of 1.629 shows positive skew in residuals.
- Condition number of 277 suggests acceptable multicollinearity.
5.3 Polynomial Regression(degree = 3)¶
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
# Step 1: Prepare the Data
X = df[['TotalWorkingYears']] # Keep it as a DataFrame
y = df['Attrition'].values
# Step 2: Apply Polynomial Features (degree = 3, you can change degree based on your preference)
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
# Step 3: Fit the Polynomial Regression Model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)
# Step 4: Make Predictions
y_poly_pred = poly_model.predict(X_poly)
# Step 5: Evaluate the Polynomial Regression Model
r2_poly = r2_score(y, y_poly_pred)
rmse_poly = np.sqrt(mean_squared_error(y, y_poly_pred))
mae_poly = mean_absolute_error(y, y_poly_pred)
print(f"Polynomial Regression - R²: {r2_poly}")
print(f"Polynomial Regression - RMSE: {rmse_poly}")
print(f"Polynomial Regression - MAE: {mae_poly}")
# Visualize the simple linear regression line
plt.scatter(X, y, color='blue', label='Actual values')
# Predict using simple linear regression (just for comparison)
linear_model = LinearRegression()
linear_model.fit(X, y)
y_linear_pred = linear_model.predict(X)
plt.plot(X, y_linear_pred, color='red', label='Linear Regression', linewidth=2)
# Visualize the polynomial regression curve
X_poly_sorted = np.sort(X.values, axis=0) # Sorting for better curve plotting
y_poly_sorted = poly_model.predict(poly.transform(X_poly_sorted))
plt.plot(X_poly_sorted, y_poly_sorted, color='green', label='Polynomial Regression', linewidth=2)
plt.title('Polynomial vs Linear Regression')
plt.xlabel('Monthly Charges')
plt.ylabel('Churn Value')
plt.legend()
plt.show()
Explanation and Compararision with linear regression¶
Model Performance Comparison
Polynomial Regression Metrics
- R² score: 0.0551 (5.51%)
- RMSE: 0.3575
- MAE: 0.2556
Linear Regression Metrics
- R² score: 0.022 (2.2%)
- RMSE: 0.335
Visual Analysis
Data Distribution
- The actual values (blue dots) show a clear binary pattern at 0 and 1, representing no attrition and attrition, respectively.
- The data points are distinctly clustered at these two levels, indicating the categorical nature of the target variable.
Regression Curves
- The linear regression line (red) shows a simple negative slope.
- The polynomial regression curve (green) shows more flexibility in fitting the data, with a curved pattern that better follows the data distribution. The polynomial curve shows higher values at the beginning, drops in the middle, and slightly rises again towards the end.
Comparative Insights
Model Fit
- The polynomial regression shows a slightly better fit with an R² of 0.0551 compared to the linear regression's 0.022.
- Both models have relatively high RMSE values, indicating significant prediction errors.
- The polynomial model's curved nature better captures the non-linear relationship in the data.
Limitations
- Both models perform poorly overall, with R² values below 0.06.
- The binary nature of the target variable suggests that neither linear nor polynomial regression is ideal for this classification problem.
- A logistic regression or other classification algorithms would be more appropriate for this binary outcome.
Conclusion
- While the polynomial regression shows marginally better performance metrics than linear regression, neither model is particularly effective at predicting employee attrition.
- The binary nature of attrition suggests that this problem would be better approached as a classification task rather than a regression problem.
5.4 Logistic Regression¶
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'df' is the DataFrame and 'Churn Value' is the target variable
# Prepare the features (X) and target (y)
X = df.drop(columns=['Attrition'])
y = df['Attrition']
# Convert categorical variables to numerical (one-hot encoding)
X = pd.get_dummies(X, drop_first=True)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features for logistic regression
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize and fit the logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_test)
# Calculate accuracy, precision, recall, and ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'ROC-AUC: {roc_auc:.4f}')
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Explanation¶
Performance Analysis
Classification Metrics
- The model achieved a high accuracy of 87.07%, indicating good overall prediction performance.
- Precision of 0.5455 shows that when the model predicts attrition, it is correct about 54.55% of the time.
- Recall of 0.3934 indicates the model only identifies about 39.34% of actual attrition cases.
- ROC-AUC score of 0.8065 suggests good discriminative ability between classes.
Confusion Matrix Insights
True Predictions
- True Negatives (360): The model correctly identified 360 employees who stayed.
- True Positives (24): Successfully predicted 24 cases of attrition.
Misclassifications
- False Positives (20): Incorrectly predicted 20 employees would leave when they actually stayed.
- False Negatives (37): Failed to identify 37 actual attrition cases.
Model Evaluation
Strengths
- High accuracy indicates good overall performance.
- Strong ROC-AUC score shows good class separation capability.
- Excellent at predicting employees who will stay (high true negatives).
Limitations
- Lower recall suggests the model misses many actual attrition cases.
- Moderate precision indicates some reliability issues in positive predictions.
- Class imbalance evident in the confusion matrix, with many more non-attrition cases.
Conclusion
- The logistic regression model shows significant improvement over the previous polynomial and linear regression approaches.
- While it performs well in overall accuracy and class separation, it struggles with identifying actual attrition cases.
- The model is better at predicting who will stay rather than who will leave, which could be attributed to the class imbalance in the dataset.
- For business applications, the model could be useful as a preliminary screening tool, but the lower recall suggests that it shouldn't be the sole decision-making factor for attrition prediction.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# Define the features and target variable
X = df[['Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'Attrition']] # Example features
y = df['Attrition'] # Target variable
# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features (important for regularization models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize the models with regularization parameters
lasso = Lasso(alpha=0.1) # Lasso with L1 penalty
ridge = Ridge(alpha=0.1) # Ridge with L2 penalty
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # ElasticNet with both L1 and L2 penalties
# Train the models
lasso.fit(X_train_scaled, y_train)
ridge.fit(X_train_scaled, y_train)
elastic_net.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_lasso = lasso.predict(X_test_scaled)
y_pred_ridge = ridge.predict(X_test_scaled)
y_pred_elastic_net = elastic_net.predict(X_test_scaled)
# Evaluate the models using Regression Metrics: R², RMSE, and MAE
def evaluate_model(y_true, y_pred):
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred)) # RMSE (Root Mean Squared Error)
mae = mean_absolute_error(y_true, y_pred) # MAE (Mean Absolute Error)
return r2, rmse, mae
# Evaluate each model
lasso_r2, lasso_rmse, lasso_mae = evaluate_model(y_test, y_pred_lasso)
ridge_r2, ridge_rmse, ridge_mae = evaluate_model(y_test, y_pred_ridge)
elastic_net_r2, elastic_net_rmse, elastic_net_mae = evaluate_model(y_test, y_pred_elastic_net)
# Display the results
print("LASSO Regression - R²:", lasso_r2, "RMSE:", lasso_rmse, "MAE:", lasso_mae)
print("Ridge Regression - R²:", ridge_r2, "RMSE:", ridge_rmse, "MAE:", ridge_mae)
print("Elastic Net Regression - R²:", elastic_net_r2, "RMSE:", elastic_net_rmse, "MAE:", elastic_net_mae)
Explanation¶
Performance Analysis
Ridge Regression
- Shows exceptional performance with an R² of nearly 1.0 (0.9999).
- Extremely low RMSE (2.97e-05) and MAE (2.16e-05).
- The L2 penalty has effectively handled multicollinearity while maintaining predictive power.
LASSO Regression
- Strong performance with R² of 0.9278.
- Higher error metrics with RMSE of 0.091 and MAE of 0.069.
- The L1 penalty has performed feature selection while maintaining good accuracy.
Elastic Net Regression
- Very good performance with R² of 0.9691.
- Moderate error metrics with RMSE of 0.060 and MAE of 0.045.
- Combines benefits of both L1 and L2 regularization.
Conclusion
- Ridge Regression emerges as the best performer with near-perfect predictions, suggesting that the features have strong linear relationships with minimal noise. The extremely low error metrics indicate potential overfitting, however, and the model should be validated on additional test data.
- While all three models show strong performance, the Elastic Net provides a good balance between feature selection and handling multicollinearity. For practical implementation, the Elastic Net might be the most robust choice despite slightly lower metrics, as it combines the benefits of both LASSO and Ridge regularization.
egularization.
5.5.2 Advanced Regression Techniques¶
5.5.2.1. Quantile Regression¶
Quantile Regression is used when you are interested in predicting specific quantiles of the conditional distribution of the dependent variable, rather than the mean (like in ordinary least squares regression). It allows you to understand the distributional properties of the data and is useful when the data is not normally distributed.
Key Use Case:¶
- Predicting the 25th, 50th, or 75th percentiles instead of just the mean value (e.g., median prediction for robust regression).
from statsmodels.regression.quantile_regression import QuantReg
import numpy as np
# Assume you have X_train and y_train
# Fit the quantile regression model for the median (quantile = 0.5)
model = QuantReg(y_train, X_train)
quantile_model = model.fit(q=0.5)
# Print summary
print(quantile_model.summary())
# Predict quantile (for example, median)
predictions = quantile_model.predict(X_test)
Explanation :
Quantile Regression
- Significant Positive Effects:
- Age, JobSatisfaction, and MonthlyIncome positively influenced attrition at the median level.
- Significant Negative Effects:
- DistanceFromHome, TotalWorkingYears, TrainingTimesLastYear, and YearsAtCompany reduced attrition likelihood.
5.5.2.2. Poisson Regression¶
Poisson regression is used for count data, particularly when the dependent variable is a count (non-negative integer values) and follows a Poisson distribution. It assumes the mean is equal to the variance and is used when you want to model event counts in a fixed interval of time or space.
Key Use Case:¶
- Modeling the number of occurrences of an event, such as the number of calls received by a call center.
import statsmodels.api as sm
import pandas as pd
# Example with count data
# Assume X_train and y_train
model = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
# Print summary
print(model.summary())
# Predict values
predictions = model.predict(X_test)
Explanation
Poisson Regression
- Significant Predictors:
- Age, JobSatisfaction, TrainingTimesLastYear, and WorkLifeBalance had significant negative relationships with attrition.
- DistanceFromHome showed a small positive effect on attrition.
5.5.2.3. Negative Binomial Regression¶
Negative Binomial regression is used when count data exhibits overdispersion, meaning the variance is greater than the mean. It is a generalization of Poisson regression.
Key Use Case:¶
- When modeling overdispersed count data, such as number of accidents or disease occurrences, where the variance is larger than the mean.
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Poisson
import pandas as pd
# Negative binomial regression assumes overdispersion
model = GLM(y_train, X_train, family=sm.families.NegativeBinomial()).fit()
# Print summary
print(model.summary())
# Predict values
predictions = model.predict(X_test)
Explanation
Negative Binomial Regression
- Significant Negative Predictors:
- Age, Education, JobSatisfaction, TrainingTimesLastYear, and WorkLifeBalance.
- MonthlyIncome had a marginally significant positive effect.
5.5.2.4. Zero-Inflated and Hurdle Regression¶
Zero-Inflated models are used for modeling count data with an excessive number of zero counts (e.g., zero-inflated Poisson or zero-inflated negative binomial). Hurdle models are similar but model the zero and non-zero outcomes separately, typically using a truncated distribution for positive counts.
Key Use Case:¶
- Modeling data where many zeros are present, like the number of hospital visits, purchases of a product, or other events where zero counts are frequent.
from statsmodels.discrete.count_model import ZeroInflatedPoisson
import pandas as pd
# Fit zero-inflated Poisson regression model
model = ZeroInflatedPoisson(y_train, X_train).fit()
# Print summary
print(model.summary())
# Predict values
predictions = model.predict(X_test)
from sklearn.linear_model import LogisticRegression
from statsmodels.genmod.families import Poisson
# Logistic regression for zero vs non-zero
logit_model = LogisticRegression().fit(X_train, y_train == 0)
# Fit count regression on positive counts
count_model = sm.GLM(y_train[y_train > 0], X_train[y_train > 0], family=Poisson()).fit()
Explanation
Zero-Inflated Poisson Regression
- Identified significant predictors:
- Age, Education, JobSatisfaction, MonthlyIncome, TrainingTimesLastYear, and WorkLifeBalance.
- Improved Model Fit:
- This model captured the excess zeros in the data better than standard Poisson regression.
5.5.2.5. Cox Regression (Cox Proportional Hazards Model)¶
Cox regression is used for survival analysis, where you are interested in predicting the time to an event (e.g., time to failure, survival time, etc.). It is used for modeling the hazard rate (the risk of an event occurring at a given time).
Key Use Case:¶
- Modeling survival times in medical research, such as predicting the time until a patient relapses after treatment.
!pip install lifelines
from lifelines import CoxPHFitter
import pandas as pd
# Sample data for demonstration
# Define the duration, event, and covariates
duration = [5, 6, 6, 2, 4] # Example durations
event = [1, 0, 1, 1, 0] # Example event indicators (1 = event occurred, 0 = censored)
X1 = [10, 20, 10, 30, 20] # Example covariate 1
X2 = [1, 2, 1, 2, 1] # Example covariate 2
# Create a DataFrame with the defined variables
data = pd.DataFrame({'duration': duration, 'event': event, 'X1': X1, 'X2': X2})
# Fit Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(data, duration_col='duration', event_col='event')
# Print summary
cph.print_summary()
# Predict hazard ratio or survival function
# Assuming X_test is defined similarly to the covariates in the data
X_test = pd.DataFrame({'X1': [15], 'X2': [1]}) # Example test data
predictions = cph.predict_partial_hazard(X_test)
Explanation
Cox Proportional Hazards Regression
- Concordance: 0.79, indicating strong predictive ability.
- Coefficients: Not statistically significant in the example, possibly due to a small sample size.
5.5.2.6. Partial Least Squares Regression (PLSR)¶
Partial Least Squares Regression is a technique for regression with high-dimensional data, where the number of predictors exceeds the number of observations. It finds the directions (latent variables) in the data that explain the variance in both predictors and responses.
Key Use Case:¶
- When you have a large number of predictors and limited observations, such as in genomics or chemometrics.
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
# Define the features and target variable
X = df[['Age', 'DistanceFromHome', 'Education', 'JobSatisfaction', 'MonthlyIncome','TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany']]
y = df['Attrition'] # Target variable
# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pca = PCA(n_components=5)
reg = LinearRegression()
# Create a pipeline combining PCA and linear regression
pcr_model = make_pipeline(pca, reg)
pcr_model.fit(X_train, y_train)
# Predict values
predictions = pcr_model.predict(X_test)
print(predictions)
Explanation
Partial Least Squares Regression (PLSR)
- Combined predictors like Age, JobSatisfaction, TrainingTimesLastYear, and others.
- Predictions ranged from negative values to 0.32, suggesting diverse attrition probabilities.
5.5.2.7. Principal Component Regression (PCR)*¶
Principal Component Regression combines Principal Component Analysis (PCA) with linear regression. PCA reduces the dimensionality of the predictors, and linear regression is performed on the reduced dimensions.
Key Use Case:¶
- When you have a high-dimensional dataset and want to reduce the dimensionality to improve model interpretability and avoid overfitting.
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
# Assume X_train, y_train is your dataset
pca = PCA(n_components=5)
reg = LinearRegression()
# Create a pipeline combining PCA and linear regression
pcr_model = make_pipeline(pca, reg)
pcr_model.fit(X_train, y_train)
# Predict values
predictions = pcr_model.predict(X_test)
print(predictions)
Explanation
Principal Component Regression (PCR)
- Used 5 principal components.
- Predictions were identical to PLSR, highlighting strong multicollinearity in predictors.
Conclusion on Advanced Regression Techniques¶
Key Predictors:
- Across all models, Age, JobSatisfaction, WorkLifeBalance, and TrainingTimesLastYear emerged as consistent factors influencing attrition.
- Models suggest focusing on job satisfaction improvements, work-life balance, and training opportunities to reduce attrition.
Zero-Inflated Poisson Insights:
- Revealed excess zeros in the data, indicating employees highly unlikely to leave. This finding suggests a need for specialized retention strategies.
Dimensionality Reduction Techniques (PLSR & PCR):
- Addressed multicollinearity effectively and identified complex interactions among predictors.
- These methods are valuable for robust attrition analysis when dealing with correlated variables.
Practical Implications for IBM:
- Develop targeted retention programs for employees in specific age or tenure groups.
- Leverage advanced models like Zero-Inflated Poisson and PLSR/PCR for more nuanced insights.
- Continuously monitor and address predictors like distance, satisfaction, and income to improve retention rates.
These advanced regression techniques provide a deeper understanding of attrition dynamics, enabling IBM to design tailored interventions for workforce stability.
6. Selected Model Evaluation and Validation¶
logit = LogisticRegression(max_iter=1000)
df.head(1)
df['TotalWorkingYears'].plot(kind='hist')
from scipy import stats
# Perform the Kolmogorov-Smirnov test
d_statistic, pval = stats.kstest(df['TotalWorkingYears'], 'norm', args=(df['TotalWorkingYears'].mean(), df['TotalWorkingYears'].std()))
if pval < 0.05:
print('not normal')
else:
print('normal')
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # Importing SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder # Importing necessary preprocessing classes
from sklearn.decomposition import PCA # Importing PCA
from sklearn.compose import ColumnTransformer # Importing ColumnTransformer
from sklearn.feature_selection import RFE # Importing RFE
from imblearn.over_sampling import SMOTE # Importing SMOTE for resampling
import numpy as np # Importing numpy for handling NaN values
# Define SMOTE instance
smote = SMOTE() # Create an instance of SMOTE
# Define one_hot_cols as needed for your categorical columns
one_hot_cols = [] # Replace with your actual categorical column names
# for monthly and total charges
logit_pipe_num = Pipeline([
('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
('scaler', StandardScaler()),
('pca', PCA(n_components=1))
])
# for all object columns
logit_pipe_cat = Pipeline([
('onehot', OneHotEncoder(drop='first')),
])
# transforming all columns
logit_transformer = ColumnTransformer([
('pipe_num', logit_pipe_num, ['MonthlyIncome', 'TotalWorkingYears']),
('pipe_cat', logit_pipe_cat, one_hot_cols)
])
# combine all pipeline
logit_pipe_combine = Pipeline([
('transformer', logit_transformer),
('rfe', RFE(logit)),
('resampling', smote),
('logit', logit)
])
from sklearn.metrics import get_scorer_names
# Get the available scorer names
scorers = get_scorer_names()
# Print the list of scorers
print(scorers)
K Fold¶
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from imblearn.pipeline import Pipeline as ImbPipeline # Corrected import
# Define your pipeline using ImbPipeline
logit_pipe_combine = ImbPipeline(steps=[
('scaler', StandardScaler()), # Scaling step
('smote', SMOTE(random_state=2021)), # SMOTE step for oversampling
('logit', LogisticRegression()) # Logistic Regression step
])
# Define your RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=2021)
# Cross-validation with recall as the scoring metric
logit_score = cross_val_score(logit_pipe_combine, X_train, y_train, scoring='recall', cv=rskf, n_jobs=-1, verbose=1)
# Print the scores
print(logit_score)
print('Logit Val Score:', logit_score)
logit_pipe_combine.get_params()
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters grid to tune
param_grid = {
'logit__C': [0.1, 1, 10], # Regularization strength
'logit__solver': ['liblinear', 'saga'] # Solvers for logistic regression
}
# Perform grid search with cross-validation
logit_grid = GridSearchCV(logit_pipe_combine, param_grid, scoring='recall', cv=5, n_jobs=-1, verbose=1)
# Fit the grid search on the training data
logit_grid.fit(X_train, y_train)
logit_tuned = logit_grid.best_estimator_
logit_tuned_score = cross_val_score(logit_tuned, X_train, y_train, scoring='recall', cv=rskf, n_jobs=-1, verbose=1)
logit_tuned_score
plt.plot(np.arange(len(logit_tuned_score)), logit_tuned_score)
plt.ylim(0,1)
plt.show()
logit_tuned_score.mean() # tuned dari cross val score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Fit the tuned model
logit_tuned.fit(X_train, y_train)
# Predict probabilities for the positive class
y_probs = logit_tuned.predict_proba(X_test)[:, 1]
# Compute precision-recall curve
precision, recall, _ = precision_recall_curve(y_test, y_probs)
# Plot precision-recall curve
plt.figure(figsize=(6, 5))
plt.plot(recall, precision, color='b', label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='best')
plt.show()
proba1 = logit_tuned.predict_proba(X_test)[:,1]
y_pred = logit_tuned.predict(X_test)
thresh =0.341275
pred_03 = np.where(proba1 > thresh, 1, 0)
res_df = pd.DataFrame({'proba1': proba1, 'y_pred': y_pred, 'y_pred03': pred_03})
plt.figure(figsize=(16,8))
plt.subplot(121)
sns.scatterplot(x=range(len(res_df)), y=res_df['proba1'], hue=res_df['y_pred03'])
plt.axhline(thresh, color='red')
plt.subplot(122)
sns.scatterplot(x=range(len(res_df)), y=res_df['proba1'], hue=res_df['y_pred'])
plt.axhline(0.5, color='red')
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'df' is the DataFrame and 'Churn Value' is the target variable
# Prepare the features (X) and target (y)
X = df.drop(columns=['Attrition'])
y = df['Attrition']
# Convert categorical variables to numerical (one-hot encoding)
X = pd.get_dummies(X, drop_first=True)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features for logistic regression
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize and fit the logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_test)
# Calculate accuracy, precision, recall, and ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'ROC-AUC: {roc_auc:.4f}')
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Model Evaluation and Comparison of All Models¶
Evaluation Metrics for All Models¶
Regression Models¶
- Simple Linear Regression:
- R²: 0.0218
- RMSE: 0.3355
- Multiple Linear Regression:
- R²: 0.0602
- RMSE: 0.3288
- Polynomial Regression:
- R²: 0.0551
- RMSE: 0.3575
- MAE: 0.2556
Classification Model¶
- Logistic Regression:
- Accuracy: 0.8707
- Precision: 0.5455
- Recall: 0.3934
- ROC-AUC: 0.8065
Comparison of Model Performance¶
Regression Models¶
- Multiple Linear Regression:
- Slightly outperforms Simple Linear Regression with a higher R² (0.0602 vs. 0.0218) and lower RMSE (0.3288 vs. 0.3355).
- Polynomial Regression:
- Shows poorer performance than both linear models with a lower R² (0.0551) and higher RMSE (0.3575).
Classification Model¶
- Logistic Regression demonstrates strong performance for classification tasks:
- High accuracy (87.07%).
- Good ROC-AUC score (0.8065), indicating strong discriminative ability.
- However, precision (0.5455) and recall (0.3934) suggest limitations in identifying positive cases of attrition.
Overall Comparison¶
- Regression Models:
- Struggle to explain variance in employee attrition, with all R² values below 0.1.
- Logistic Regression:
- Outperforms regression models for classification, making it a more suitable approach for predicting attrition.
7.Results and Interpretation¶
Summary of Findings¶
- All regression models demonstrate low explanatory power for employee attrition.
- Multiple Linear Regression slightly outperforms other regression models but still explains only 6.02% of the variance in attrition.
- Logistic Regression shows strong overall performance for classification, with:
- Accuracy: 87.07%
- ROC-AUC: 0.8065
- Precision and recall for Logistic Regression indicate a trade-off between correctly identifying attrition cases and capturing all actual attrition instances.
- Poor regression performance suggests complex, non-linear relationships between variables and attrition.
Insights and Implications¶
Complex Attrition Factors:
- Employee attrition is influenced by factors not fully captured by current variables.
Classification Suitability:
- A classification approach may be more suitable for predicting employee attrition than regression methods.
Cautious Use of Logistic Regression:
- Logistic Regression should be used cautiously, perhaps as a screening tool rather than a definitive predictor.
Feature Relevance:
- Certain features may be more significant for predicting attrition when considered in a binary context rather than a continuous one.
Improvement Strategies:
- Exploring non-linear techniques, feature engineering, additional relevant variables, and address class imbalance.
Model Interpretability and Further Improvements¶
Interpretability Techniques¶
- Intrinsic Analysis:
- Starting with simple, interpretable models like linear regression or decision trees.
- Post Hoc Analysis:
- Using LIME (Local Interpretable Model-agnostic Explanations) or SHAP (SHapley Additive exPlanations) for explaining individual predictions.
- Global Surrogate Models:
- Training interpretable models to approximate the behavior of complex black-box models.
- Feature Importance:
- ing permutation feature importance or embedded methods to identify influential features.
- Partial Dependence Plots:
- Visualizing the relationship between features and the target variable, accounting for average effects of other features.
Advanced Techniques¶
- Feature Engineering:
- Creating interaction or polynomial features to capture complex relationships.
- Domain Knowledge Integration:
- Incorporating industry-specific insights to craft relevant features.
- Ensemble Methods:
- Exploring ensemble techniques or deep learning approaches for better performance while balancing interpretability.
Customer Behavior and Attrition Prediction¶
Segmentation Strategies¶
Demographic Segmentation:
- Group employees by age, gender, totalworkingyears, and monthlyincome.
- Analyze churn rates across different demographic groups to identify high-risk segments.
Behavioral Segmentation:
- Categorize employees based on engagement levels, performance metrics, and interaction patterns.
- Identify behavioral indicators like reduced participation, decreased productivity, or increased absenteeism.
Lifecycle Segmentation:
- Segment employees by totalworkingyears and joblevel within the company.
- Tailor retention strategies based on attrition patterns in different lifecycle stages.
Feedback and Satisfaction Segmentation:
- Group employees based on satisfaction scores from surveys or performance reviews.
- Track changes in satisfaction levels over time to predict potential attrition.
Implications¶
By leveraging segmentation approaches, you can develop more targeted and effective strategies to predict and prevent employee attrition. This granular approach allows for identifying specific risk factors within each segment, enabling personalized retention efforts.
Conclusion :¶
To conclude, this project effectively demonstrated how logistic regression, coupled with regularization techniques, can be leveraged to predict Employee Attrition. By integrating regularization, we mitigated the risk of overfitting, ensuring the model's ability to generalize to new data. The findings from the model offer valuable insights for businesses aiming to reduce churn and improve their customer retention strategies. This approach can be further expanded by adding more features, employing advanced models, and refining customer segmentation methods to enhance business decision-making.
From the analysis of the factors affecting customer churn, the results provide sufficient evidence to reject the null hypothesis. Therefore, we conclude that there is a significant relationship between employee attributes (such as age, job satisfaction, monthly income, distance from home, work-life balance, and performance rating) and employee attrition.
References¶
DataCamp. "An Introduction to Exploratory Data Analysis." Available at: https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python
Seaborn Documentation. "Seaborn Overview." Available at: https://seaborn.pydata.org/
Matplotlib Documentation. "Matplotlib User Guide." Available at: https://matplotlib.org/stable/users/index.html
Medium. "A Comprehensive Guide to Exploratory Data Analysis (EDA)." Available at: https://medium.com/
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
Towards Data Science. "Exploratory Data Analysis (EDA) Visualization Using Pandas." Available at: https://towardsdatascience.com/